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Abstract 

In response to the need for learning tools tuned to big data analytics, the present paper 
introduces a framework for efficient clustering of huge sets of (possibly high-dimensional) data. 
Building on random sampling and consensus (RANSAC) ideas pursued earlier in a different 
(computer vision) context for robust regression, a suite of novel dimensionality- and set-reduction 
algorithms is developed. The advocated sketch-and-validate (SkeVa) family includes two algo¬ 
rithms that rely on A'-means clustering per iteration on reduced number of dimensions and/or 
feature vectors: The first operates in a batch fashion, while the second sequential one offers 
computational efficiency and suitability with streaming modes of operation. For clustering even 
nonlinearly separable vectors, the SkeVa family offers also a member based on user-selected 
kernel functions. Further trading off performance for reduced complexity, a fourth member 
of the SkeVa family is based on a divergence criterion for selecting proper minimal subsets of 
feature variables and vectors, thus bypassing the need for A'-means clustering per iteration. 
Extensive numerical tests on synthetic and real data sets highlight the potential of the proposed 
algorithms, and demonstrate their competitive performance relative to state-of-the-art random 
projection alternatives. 

Keywords. Clustering, high-dimensional data, variable selection, feature vector selection, 
sketching, validation, A'-means. 


1 Introduction 

As huge amounts of data are collected perpetually from communication, imaging, and mobile 
devices, medical and e-commerce platforms as well as social-networking sites, this is undoubtedly 
an era of data deluge m- Such “big data” however, come with “big challenges.” The sheer volume 
and dimensionality of data make it often impossible to run analytics and traditional inference 
methods using stand-alone processors, e.g., mm- Consequently, “workhorse” learning tools have 
to be re-examined in the face of today’s high-cardinality sets possibly comprising high-dimensional 
data. 

Clustering (a.k.a. unsupervised classification) refers to categorizing into groups unlabeled data 
encountered in the widespread applications of mining, compression, and learning tasks [4]. Among 
numerous clustering algorithms, A'-means is the most prominent one thanks to its simplicity |4j. 
It thrives on “tight” groups of feature vectors, data points or objects that can be separated via 
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(hyper)planes. Its scope is further broadened by the so-termed probabilistic and kernel A-means, 
with an instantiation of the latter being equivalent to spectral clustering - the popular tool for 
graph-partitioning that can cope even with nonlinearly separable data [12]. 

A key question with regards to clustering data sets of cardinality N containing D-dimensional 
vectors with N and/or D huge, is: How can one select the most “informative” vectors and/or 
dimensions so as to reduce their number for efficient computations, yet retain as much of their 
cluster-discrimination capability? This paper develops such an approach for big data R-means 
clustering. Albeit distinct, the inspiration comes from random sampling and consensus (RANSAC) 
methods, which were introduced for outlier-resilient regression tasks encountered in computer vi¬ 
sion ismaiEaEaisaES]. 

Feature selection is a rich topic m explored extensively from various angles, including pattern 
recognition, source coding and information theory, (combinatorial) optimization [311133] . and neural 
networks [6]. Unfortunately, most available selection schemes do not scale well with the number 
of features D, particularly in the big data regime where D is massive. Recent approaches to 
dimensionality reduction and clustering include subspace clustering, where minimization problems 
requiring singular value decompositions (SVDs) are solved per iteration to determine in parallel a 
low-dimensional latent space and corresponding sparse coefficients for efficient clustering [26j. Low- 
dimensional subspace representations are also pursued in the context of kernel K -means m Aig. 2 ], 
where either an SVD on a sub-sampled kernel matrix, or, the solution of a quadratic optimization 
task is required per iteration to cluster efficiently large-scale data. 

Randomized schemes select features with non-uniform probabilities that depend on the so- 
termed “leverage scores” of the data matrix [19[l20j. Unfortunately, their complexity is loaded by 
the leverage scores computation, which, similar to [26] . requires SVD computations - a cumbersome 
(if not impossible) task when D and/or N 3>. Recent computationally efficient alternatives for 
feature selection and clustering rely on random projections (RPs) [5ll9l fl9ll20] . Specifically for RP- 
based clustering [5j, the data matrix is left multiplied by a data-agnostic (fat) dx D RP matrix to 
reduce its dimension (d -C D ); see also [18] where RPs are employed to accelerate the kernel K- 
rneans algorithm. Clustering is performed afterwards on the reduced d-dimensional vectors. With 
its universality and quantifiable performance granted, this “one-RP-matrix-fits-all” approach is not 
as flexible to adapt to the data-specific attributes (e.g., low rank) that is typically present in big 
data. 

This paper’s approach aspires to not only account for structure, but also offer a gamut of novel 
randomized algorithms trading off complexity for clustering performance by developing a family 
of what we term sketching and validation (SkeVa) algorithms. The SkeVa family includes two 
algorithms based on efficient intermediate /v-means clustering steps. The first is a batch method, 
while the second is sequential thus offering computational efficiency and suitability for streaming 
modes of operation. The third member of the SkeVa family is kernel-based and enables big data 
clustering of even nonlinearly separable data sets. Finally, the fourth one bypasses the need for 
intermediate K -means clustering thus trading off performance for complexity. Extensive numerical 
validations on synthetic and real data-sets highlight the potential of the proposed methods, and 
demonstrate their competitive performance on clustering massive populations of high-dimensional 
data vs. state-of-the-art RP alternatives. 

Notation. Boldface uppercase (lowercase) letters indicate matrices (column vectors). Calligraphic 
uppercase letters denote sets (0 stands for the empty set), and \A\ expresses the cardinality of A. 
Operators || • ||2 and || • ||i stand for the 1 2 - and £i-norm of a vector, respectively, while (-) T denotes 
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transposition and 1 the all-one vector. 


2 Preliminaries 

Consider the D x N data matrix X := [x\,. .. ,xn] with D and/or N being potentially massive. 
Data {x n }n = x belong to a known number of K clusters (K <C N). Each cluster C k is represented by 
its centroid c k that can be e.g., the (sample) mean of the vectors in C k - Accordingly, each datum can 
be modeled as x n = Cir n + v n , where C := [ci,. .., ck\', the sparse I\ x 1 vector 7r n comprises the 
data-cluster association entries satisfying 'Yhk=i\' K n\k = 1) where [n n \ k G (0,1] when x n G C k , while 
[tv n ]k = 0 , otherwise; and the noise v n captures x n 's deviation from the corresponding centroid(s). 

For hard clustering, the said associations are binary ([7T n ]fc G {0,1}), and in the celebrated 
hard A'-means algorithm they are identified based on the Euclidean (£ 2 ) distance between x n and 
its closest centroid. Specifically, given K and {x n }^ = 1 , per iteration i = 1,2,..., the if-means 
algorithm iteratively updates data-cluster associations and cluster centroids as follows; see e.g., @]. 


[z-a] Update data-cluster associations: For n = 1,..., N, 


x n eC k [i]<^ke argmin ||£c n - c k ,[i]\\l . 


(la) 


[?-b] Update cluster centroids: For k = 1,..., AT, 



(lb) 


Although there may be multiple assignments solving (Hal) , each x n is assigned to a single cluster. To 
initialize (Hal) , one can randomly pick {cfc[l]}^ 1 from {* n }(/ =1 . The iterative algorithm (jT]) solves 
a challenging NP-hard problem, and albeit its success, A'-means guarantees convergence only to 
a local minimum at complexity O(NDKI), with I denoting the number of iterations needed for 
convergence, which depends on initialization [3J § 9.1]. 

Remark 1 . As (Hal) and (llbl) minimize an l^-norm squared, hard AT-means is sensitive to out¬ 
liers. Variants exhibiting robustness to outliers adopt non-Euclidean distances (a.k.a. dissimilarity 
metrics) 5, such as the £i-norm. In addition, candidate “centroids” can be selected per iteration 
from the data themselves; that is, c G C k [i\ in (llbl) . Notwithstanding for this so-termed K-medoids 
algorithm, one just needs the distances {6(x n , x n /)} to carry out the minimizations in (Hal) and 
The latter in turn allows {a: n }(/ =1 to even represent non-vectorial objects (a.k.a. qualitative 
data), so long as the aforementioned (non-)Euclidean distances can become available otherwise; 
e.g., in the form of correlations [H § 9.1]. 

Besides various distances and centroid options, hard A'-means in (JT|) can be generalized in three 
additional directions: (i) Using nonlinear functions tp : R D —> T~L, with H being a potentially inhnite- 
dimensional space, data {x n } can be transformed to {ip(x n )}, where clustering can be facilitated 
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(cf. Sec. 13.31) : (ii) via non-binary ir n £ [0,1] A , soft clustering can allow for multiple associations per 
datum, and thus for a probabilistic assignment of data to clusters; and (iii) additional constraints 
(e.g., sparsity) can be incorporated in the [7v n ] k coefficients through appropriate regularizers p( 7r). 
All these generalizations can be unified by replacing 0 per iteration i = 1,2,..., with 


[i- a] Update data-cluster associations: n = 1,... ,1V, 


K 


n n \i] £ argmin 5 <p(x n ), V] TT k c k [i] + p(n). 

7rg[0,l] K ; V k= 1 / 


1 1 7T=1 

[i-b] Update cluster centroids: 


(2a) 


{c k [i + l]}f = i € argmin 
{c fc }f=iC H 


N / K 

Y $ <p(x n ),Y knH] k C ^ 

n =1 V fc=l 


(2b) 


In Sec. 13.31 function will be implicitly used to map nonlinearly separable data {cc n }') r =] to linearly 
separable (possibly infinite dimensional) data {</?(a: n )} A =1 , whose distances can be obtained through 
a pre-selected (so-termed kernel) function k [41 Chap. 6]. The regularize!' p{7v) can enforce prior 
knowledge on the data-cluster association vectors. 

To confirm that indeed (JX[) is subsumed by ([2]), let <p{x n ) = x n ; choose 5 as the squared Euclidean 
distance in M A ; and set p( 7r) = 0, if tv £ £k '■= {^k} k =i-> whereas pfn) = +oc otherwise, with e k be¬ 
ing the kih A-dimensional canonical vector. Then (l2al) becomes 7r n [i] £ argmin^. e | 01 jK. 1 t 77=1 \\x n — 
Y, f; 7TfcCfc[z] || 2 , which further simplifies to the minimum distance rule of (Hal) . Moreover, it can be 
readily verified that (I2bl) {c k [i + 1]} A =1 £ argmin^ Y n W x n — W]fc c fell 2 separates across 

ks to yield the centroid of (llbl) per cluster. 

To recognize how (ED captures also soft clustering, consider that cluster C k is selected with 
probability Tt k '■= Pr (C k ), and its data are drawn from a probability density function (pdf) p 
parameterized by 9 k \ i.e., x\C k ~ p(x\9 k )- If p is Gaussian, then 6 k denotes its mean p, k and 
covariance matrix S k . With 6 := [9j,...,9j^] T and allowing for multiple cluster associations, 
the likelihood per datum is given by the mixture pdf: p(x-,7v,9) = Yk=i 7r fcP( a: ; &k)> which for 
independently drawn data yields the joint log-likelihood (II := [vri,... ,7T/v]) 


N 


In p(X-U,9) = Y^ E [7T n \kP(x n ]9 k ) 


n =1 


K 


k =1 


(3) 


If <p(x n ) := x n , c k := p(x n ;9 k ), and S(x n ,Yk 7r k c k) '■= ~ ln (Y,kl 7V ri\kP(x n -9 k )) in 0, then soft 
A"-means iterations (I2al) and (I2bl) maximize 0 with respect to (w.r.t.) II and 9. An alternative 
popular maximizer of the likelihood in 0 is the expectation-maximization algorithm; see e.g., [5J 
§ 9.3]. 

If the number of clusters K is unknown, it can also be estimated by regularizing the log-likelihood 
in 0 with terms penalizing complexity as in e.g., minimum description length criteria [3]. 

Although hard A'-means is the clustering module common to all numerical tests in Sec. V, the 
novel big data algorithms of Secs. Ill and IV apply to all schemes subsumed by 0. 
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3 The SkeVa Family 


Our novel algorithms based on random sketching and validation are introduced in this section. 
Relative to existing clustering schemes, their merits are pronounced when D and/or N take pro¬ 
hibitively large values for the unified iterations (12al) and (|2bj) to remain computationally feasible. 


3.1 Batch algorithm 

For specificity, the SkeVa K-means algorithm will be developed first for D 3>, followed by its variant 
for N 3>. 

Using a repeated trial-and-error approach, SkeVa K-means discovers a few dimensions (features) 
that yield high-accuracy clustering. The key idea is that upon sketching a small enough subset of 
dimensions (trial or sketching phase), a hypotheses test can be formed by augmenting the original 
subset with a second small subset (up to d affordable dimensions) to validate whether the first subset 
nominally represents the full D-dimensional data (error phase). Such a trial-and-error procedure is 
repeated for a number i? max of realizations, after which the features that have achieved the “best” 
clustering accuracy results are those determining the final clusters on the whole set of dimensions. 

Starting with the trial-phase per realization r, d dimensions (rows) of X are randomly drawn 


(uniformly) to obtain X^ := [aR 


M 




Mi 

N 


G 


pdxN 


With d small enough, iv-means is run on 


X^ to obtain clusters {C ^} k=l and corresponding centroids {c^}^ =1 [cf. (fTal) and (|lbj) ]. These 
sketching and clustering steps comprise the (random) sketching phase. 

Moving on to the error-phase of the procedure, the quality of the d-dimensional clustering is 
assessed next using what we term validation phase. This starts by re-drawing d'-dimensional data 
{in (d' <C D — d), generally different from those selected in draw r. Associating each Xn ^ 

with the cluster xh J belongs to, the centroids corresponding to the extra d dimensions are formed 
as [cf. (flb|) ] 


,(*■') 


cl = 




Mi 


E 


x 


M) 


(4) 




Let Xn := [x n ,Xn ; J and c y k := [c k ,c k j denote respectively the concatenated data 
and centroids from draws r and r', and likewise for the data and centroid matrices X^ and 
. Measuring distances {<5(®^, cjj^)} and using again the minimum distance rule data-cluster 
associations and clusters {C k ^} k=1 are obtained for the “augmented data.” If per datum x n the 
data-cluster association in the space of d dimensions coincides with that in the space of d := d + d! 
dimensions, then x n is in the validation set (VS) that is, 


yM — 

V D ■— 


j*n € C^\ x^ G cjfj, and h = k 2 } 


(5) 


Quality of clustering per draw is then assessed using a monotonically increasing rank function / of 
the set V D . Based on this function, a d-dimensional trial r\ is preferred over another d-dimensional 
trial r 2 if /(V^°) > /(v})' 2) ). 

The sketching and validation phases are repeated for a prescribed number of realizations R max . 
At last, the d-dimensional sketching r* := argmax^M ...,R max }/O^) yields the hnal clusters, 
namely {C k ^}{Ly see Alg. [TJ 
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Algorithm 1 Batch SkeVa K-means 

Input: Data X; number of clusters K ; reduced dimensions d and d! for the sketching and validation 
phases, respectively; ranking function /; number of realizations (draws) R max . 

Output: Data-cluster associations on X. 

1: for r = 1 to i?max do 

2: Randomly sample rows of X to obtain AR r ). 

3: Run A'-means on X ( r ); obtain clusters {cj! ) }[L 1 and centroids {c^}^ =1 [cf. dU)]. 

4: Randomly sample d! <C D rows of X (other than those in step [2]) to obtain X^ r '\ 

5: Per cluster k , form cjjT^ := [c^ T ,c^ ^ T ] T . 

6: Associate to closest • 

7: Identify validation set [cf. ©]. 

8 : end for 

9: r* := arg max re{1) Rmax} f(Vjy). 

10: Associate data to clusters on X as in v 


With regards to selecting /, a straightforward choice is the VS cardinality, that is f(Vp ) := 

| Vfj } |, which can be thought as the empirical probability of correct clustering. Alternatively, a 
measure of cluster separability, used extensively in pattern recognition, is Fisher’s discriminant 
ratio [2], which in the present context becomes 


K I< 


pR) _ pR) 

H-i 


fdrm : = y y .. „ .,. 

i=l k 2 = 1; Wfci ) + { a k 2 ) 


fci = 


k 2 ^k i 


where (cr[ r ^) 2 is the unbiased sample variance of cluster k: 


( 4 r) R= 


iR’i -1 


£ 




(r) 


— c 


R) 


( 6 ) 


(7) 


The larger the FDR^ r \ the more separable clusters are. Obtaining FDR^ r 7 is computationally light 
since distances in (J7|) have been calculated during the validation phase of the algorithm. The only 
additional burden is computing the numerator in ([HD in 0[(d + d')K 2 ] complexity. Based on FDR, 
a second choice for / is 



Instead of FDR^, the exponent is — 1/FDR^ to avoid pathological cases where FDR^ approaches 
-poo, e.g., when all points in a cluster are very concentrated so that (a^) 2 ~ 0. 

Alg. [T] incurs overall complexity 0(NKR max dI), where / is an upper bound on the number 
of iterations needed for A'-means to converge in step 0 plus Q(NKR rn ^ d') required in step [U] of 
Alg.Hl Parameters d, d', and A max are selected depending on the available computational resources; 
(d, d') should be such that running the computations of Alg. [Don {d + d^-dimensional vectors can 
be affordable by the processing unit used. A probabilistic argument for choosing A max can be 
determined as elaborated next. 
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Remark 2. Using parameters that can be obtained in practice, it is possible to relate the number 
of random draws i? max with the reliability of SkeVa-based clustering, along the lines of analyzing 
RANSAC [EE]. 

To this end, let p denote the probability of having out of R SkeVa realizations at least one 
“good draw” of d “informative” dimensions, meaning one for which /i-means yields data-cluster 
associations close to those found by A"-means on the full set of D dimensions. Parameter p is a 
function of the underlying cluster characteristics. It can be selected by the user and reflects one’s 
level of SkeVa-based big data clustering reliability, e.g., p := 0.95. The probability of having all 
“bad draws” after R SkeVa repetitions is clearly 1 — p. Moreover, let q denote the probability that 
a randomly drawn row of X is “informative.” In other words, q quantifies prior information on the 
number of rows (out of D) that carry high discriminative information for /i-means. For instance, 
q can be practically defined by the leverage scores of X, which typically rank the importance of 
rows of X in large-scale data analytics 0120]. An estimate of the leverage scores across the rows 
of X expresses q as the percentage of informative rows. Alternatively, if x n . = m^ + X/ v rij 
is the data generation mechanism per cluster k, with v n . ~ jV(0,i£>), then the zth entry of x nj 
is ejx nj ~ Af(ejmi t , [X^,],,). Thus, rows of X are realizations of a Gaussian 1 x N random 
vector. If these rows are clustered in K' groups, q can capture the probability of having an 
“informative” row located within a confidence region around its associated centroid that contains 
a high percentage a € (0,1) of its pdf mass. Per SkeVa realization, the probability of drawing 
d “non-informative” rows can be approximated by (1 — q) d . Due to the independence of SkeVa 
realizations, 1 — p = (1 — q) dR , which implies that R — log(l — p)/[d log(l — q)]. Clearly, as R 
increases there is (a growing) nonzero probability that the correct clusters will be revealed. On the 
other hand, it must be acknowledged that if R is not sufficient, clustering performance will suffer 
commensurately. 

It is interesting to note that as in [Ml, R only depends implicitly on D, since q is a percentage, 
and does not depend on the validation metric or pertinent thresholds and bounds. 

3.2 Sequential algorithm 

Drawing a batch of d! features (rows of JV) during the validation phase of Alg. [T] to assess the 
discriminating ability of the features drawn in the sketching phase may be computationally, espe¬ 
cially if d! is relatively large. The computation of all distances {6(xn\ c/^)} in step [6] of Alg. |T| 
can be prohibitive if d! becomes large. This motivates a sequential augmentation of dimensions, 
where features are added one at a time, and computations are performed on only a single row of X 
per feature augmentation, till the upper-bound d! is reached. Apparently, such an approach adds 
flexibility and effects computational savings since sequential augmentation of dimensions does not 
need to be carried out till d! is reached, but it may be terminated early on if a prescribed criterion 
is met. These considerations prompted the development of Alg. [21 

The sketching phase of Alg. [2] remains the same as in Alg. |T| In the validation phase, and for 
each dimension in the additional d! ones, {c/ are obtained as in Alg. [Tj [cf. (|4|)], and likewise 

for Vq. If is smaller than the current maximum value /max in memory, the /-dimensional 

clustering {c/' 1 } is discarded, and a new draw is taken. This can be seen as a “bail-out” test, to 
reject possibly “bad clusterings” in time, without having to perform the augmentation using all d! 
dimensions. 
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Algorithm 2 Sequential (Se)SkeVa K-means. 


for r = 1 to R r , 


do 


Input: Data X; number of clusters K ; reduced dimensions d and d! of reduced dimensions for 
the sketching and validation phases, respectively; ranking function /; number of realizations 
(draws) R max . 

Output: Data-cluster associations on X. 

1 : 

2 

3 

4 

5 

6 

7 

8 
9 

10 
11 
12 

13 

14 

15 

16 

17 

18 


Randomly sample d <C D rows of X to obtain X^ r K 
Run /v-means on X^ to obtain clusters and centroids 

Initialize the auxiliary set of dimensions A = 0. 
for j = 1 to d' do 

Randomly sample 1 dimension of X (other than those in step [2] and in A) to obtain row 
x( r ') of X. 

Include sampled dimension in A. 


Form {c\ : = 


Ar) r~(r)T xh'hTiA' 


1 1 \ K 

l Jfc=l 


as m 


Alg.HJ 


fc=i • 


Associate to closest {c ^}\L 

Identify validation set [cf. (J5])] . 

if /(V^) < /mix or |V/(V£ } )| < e then 
Go to step [2j 

end if 
end for 

A r i 1) = /(v£>). 

r* = r. 

end for 

Data-cluster associations on X according to 
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Figure 1: f(V^) vs. the number of augmented dimensions d! for a synthetic data-set, with D = 
2,000, N = 1,000, d = 50, and full-rank data-model (cf. Secs. 13.21 and l5lh 


Experiments corroborate that it is not necessary to augment all D — d dimensions (cf. Fig. [I]), 
but using a small subset of them provides satisfactory accuracy while reducing complexity. An 
alternative route is to stop augmentation once the “gradient” of /, meaning finite differences across 
augmented dimensions, drops below a prescribed e > 0; that is, |V/(V^)| < e. The sequential 
approach is summarized in Alg. [2] and has complexity strictly smaller than 0[NKR ma , x (dl + d')\. 

Remark 3. Using N in the place of D whenever D <C N, or equivalently, replacing X £ M DxN 
with X T , both the batch and sequential schemes developed for D S> can be implemented verbatim 
for N 3>. This variant of SkeVa K-means will be detailed in the next subsection for nonlinearly 
separable clusters. 


3.3 Big data kernel clustering 

A prominent approach to clustering or classifying nonlinearly separable data is through kernels; 
see e.g., [1]. Vectors {x n }^ =1 are mapped to W{x n )}^ =1 that live in a higher (possibly infinite-) 
dimensional space H , where inner products defining distances in using the induced norm || ■ || H : = 
(• | , are given by a pre-selected (reproducing) kernel function ft; that is, (tp(x n ) \ ip(x n f)) n = 

0. An example of such a kernel is the Gaussian one: K^(x n ,x) := exp [—{x — 
x n ) T ’E~ 1 (x - *n)/2]/[(27r) D / 2 (det S) 1 / 2 ]. 

For simplicity in exposition, our novel kernel-based (Ke)SkeVa K-means approach to big data 
clustering will be developed for the hard K -means. Extensions to kernel-based soft SkeVa K-means 
follow naturally, and are outlined in Appendix |A] Similar to (JT]) , the kernel-based hard ii-means 
proceeds as follows. For *£{1,2,..., /}, 


[*- a] Update data-cluster associations: For n = 1,... ,N, 

x n eC k [i\<^k€ argrnin \\<p(x n ) - c k \i]\\ 2 H 


(8a) 


9 



[i-b] Update cluster centroids: For k = 1,..., K, 


Cfc [i + 1] e argmin ) 

c£H ^ 


Xn£Ck [i 


1 


\ C M\ 


,11 E 


Xn eCk [i] 


,M®n) ~ cf n 

V{ x n) 


(8b) 


where Euclidean norms || ■ j| 2 in the standard form of K-means have been replaced by || • ||^. As the 
potentially infinite-size {cfc[i+l]}[L 1 cannot be stored in memory, step 18b[) is implicit. In fact, only 
k and the data-cluster associations suffice to run ([8]). To illustrate this, substitute {c^fz + 1]}* =1 
from (l8bl) into (IKal) to write 


tp( x n) 


1 


| Ck' \i + 1] | 


*-^x'eCu i+i 


n 


( l P( x n) | l f( x n))-}{ 
2 


(9) 


+ 


+ 


| Ck> 

[ <+1 iL 


1 

| Ck' 

[i+ HI 2 , 

*En-) • 

X n) 


1 

| Ck> 



n 


E M®**) I ¥>(®n)> 

\Cy [i+1] 

E^«) I v( x n))n 


■ E*(* 

<ec fc / [i+ 1 ] 


E K ( x n> x n)- 


( 10 ) 


Having established that distances involved in SkeVa K-means are expressible in terms of the 
chosen kernel k, the resulting iterative scheme is listed as Alg. [3j After randomly selecting an 
affordable subset X^ r \ comprising v columns of X per realization r, and similar to the trial-and- 
error step in line [3] of Alg. [H the (kernel) A"-means of ([8]) is applied to X^ r \ The validation phase 
of KeSkeVa K-means is initialized in line [4] where a second subset X^ r ) comprising v' columns 
from X \ X (r K The distances between data {<p(xn ^)} and centroids {cj^} involved in step [5] of 
Alg. [3] are also obtained through kernel evaluations [cf. ([TO]) ]. This KeSkeVa that operates on the 
number of data-points rather than dimensions follows along the line of RANSAC m but with two 
major differences: (i) Instead of robust parameter regression, it is tailored for big data clustering; 
and (ii) rather than consensus it deals with affordably small validation sets across possibly huge 
data-sets. 


During the validation phase, clusters Cl are specified according to x\-, 


£ Cl O k £ 


argmin fc , e r 1 ....) — cjj., ||^. Gathering all information from draws (r,r'), the augmented 


clusters c[P := c[P U CY 1 (step [5] of Alg. EJ) lead to centroids 


?(*•') 


•- 


|C 


A) I 




V{Xn)■ 


( 11 ) 


xGC 


(r) 
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Given the “implicit centroids” obtained as in dill) , data X( r ) are mapped to clusters {C^} which 


are different from C 


(r) 

k . To assess this difference, the distance between (p{x 


and c[ r is computed, 


and columns of X^ are re-grouped in clusters {C k }^ =1 as 


x 


(r) G Mr) 


4=> k E argmin 


<P 


i x n ’) - 


A r ) 

c k' 


H 


( 12 ) 


Recall that distances are again obtained through kernel evaluations [cf. ([TO]) ]. 

The process of generating clusters and centroids in KeSkeVa K-means can be summarized as 
follows: (i) Group randomly drawn data X^ into clusters with centroids c k ' > ; (ii) draw 
additional data-points, augment clusters C k \ and compute new centroids cjjT^; (iii) given c£\ find 
clusters C k 1 as in (1X21) . Since cj^ / c : k in general, data belonging to C k ’ > do not necessarily belong 
to C k \ and vice versa; while data common to and C k \ that is data that have not changed 
“cluster membership” during the validation phase, comprise the validation set 


V 


t ) : =k r) ei (r ) 3k s.t. 


N 




(13) 


Among i? max realizations, trial r* with the highest cardinality |vj^| is identified in Alg. [3l and data 
are finally associated with clusters The overall complexity of Alg. [3] is Q(DKR m ^ v 2 1) 

in step El when {n{x n ,x n i)}^ n , =1 are not stored in memory and kernel evaluations have to be 
performed for all employed data per realization, plus 0(DR max i'i'') in stepEJ If { n(x n , x n /)}^ n , =l 
are stored in memory, then Alg. [3] incurs complexity 0(KR max n 2 I + R max i'i''), which is quadratic 
only in the small cardinality A 


Algorithm 3 Kernel (Ke)SkeVa K-means 


Input: Data A; number of clusters K; number v and v' of data during sketching and validation 
phase, respectively; number of realizations R max . 

Output: Data-cluster associations on X. 


1 

2 

3 

4: 

5: 


for r = 1 to R max do 

Randomly sample v -C N columns of X to obtain X^ r ’. 

Apply A'-means [cf. (1201) and ©] on {ip(xn ^)}^=i; obtain clusters {C k > }u=i an d centroids 
/xW \ K 

\ c k Sk=l- 

Randomly select v' <C N columns of X , other than those of step [21 to obtain X( r '\ 
Associate {ip(xn ^)}n=i to the closest centroids obtain clusters and cen¬ 

troids (4 r) }f = i [cf. m- 


6: Obtain clusters {C^ ] } on X^ according to (fT2l) . 

7: Identify the validation set V y N . 

8 : end for 

9: r* = arg min rg{1 jj max} |V^ |. 

10: Associate {(p(x n )}^ =1 according to the closest centroids obtained from clusters {C k *^} k=l . 


One remark is now in order. 
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Remark 4. Similar to all kernel-based approaches, a critical issue is selecting the proper kernel 
- more a matter of art and prior information about the data. Nonetheless, practical so-termed 
multi-kernel approaches adopt a dictionary of kernels from which one or a few are selected to run 
if-means on [23] . It will be interesting to investigate whether such multi-kernel approaches can be 
adapted to our SkeVa-based operation to alleviate the dependence on a single kernel function. 


4 Divergence-Based SkeVa 

A limitation of Algs. [U and [2] is the trial-and-error strategy which requires clustering per random 
draw of samples. This section introduces a method to surmount such a need and select a small 
number of data or dimensions on which only a single clustering step is applied at the last stage of 
the algorithm. As the “quality” per draw is assessed without clustering, this approach trades off 
accuracy for reduced complexity. 


4.1 Large-scale data sets 


First, the case where random draws are performed on the N data will be examined, followed by 
draws across the D dimensions. Any randomly drawn data from {x n }^ =1 will be assumed centered 
around their sample mean. 

Since intermediate clustering will not be applied to assess the quality of a random draw, a 
metric is needed to quantify how well the randomly drawn samples represent clusters. To this end, 
motivated by the pdf mixture model [cf. p]>], consider the following pdf estimate formed using the 
randomly selected data 

P {r) (x) := K { x n\ x ) ( 14 ) 

n =1 


(r) 

where k stands for a user-defined kernel function, parameterized by x\ and satisfying 
f n(xn\x)dx = 1 for the estimate in ( 1141 ) to qualify as a pdf. Here, the Gaussian kernel 
Kj}(xn\x) ■= exp[—(* — Xn' > ) T H^ 1 (x — )/2]/[(27r) jC> / 2 (det X) 1 / 2 ] is considered with S := ct 2 Id- 

Having linked random samples with pdf estimates, the assessment whether a draw represents 
well the whole population {x n }^ =1 translates to how successful this draw is in estimating the actual 
data pdf via ( 1141 ) . A random sample where all selected data form a single cluster is clearly not a 
good representative of {x n }!^ =1 . For example, if the selected points are gathered around 0 (recall 
that drawn data are centered around their sample mean), then the resulting pdf estimate ( 1141 ) will 
resemble the uni-modal (thick) dashed curve in Fig. [2j Such a pdf estimate is a poor representative 
of the whole data-set, and clustering on that small population of data should be avoided. On the 
contrary, random draws yielding the multi-modal (thick) solid curve in Fig. [5] should be highly rated 
as potential candidates for clustering. As a first step toward assessing draws is a metric quantifying 
how “far” the pdf p ^ is from p(°\x) := k(0,x). 

Among candidate metrics of “distance” between pdfs, the Cauchy-Schwarz divergence m is 
chosen here: 


A C s(p (r) |l)P (r,) ) := 


( I p ( ' r \x)p( r '\x)dx) 

log ———---—- 

(x)] 2 dx f\p( r 'l(x)] 2 dx 
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Figure 2: Examples of pdf mixtures fitted to data-points. Reference pdf estimate is the uni-modal 
thick dashed curve; the larger the divergence from this estimate, the larger the probability of 
producing meaningful clustering. 


The reason for choosing Acs over other popular divergences, such as the Kullback-Leibler one, is 
the ease of obtaining pdf estimates via (fTTl) . Specifically, the numerator in Acs can be expressed 
as [cf. (fT4|) ] 


= ^i K K r) ’ *) K ( X n' ) > *) d: 

n=ln'=r 


(15) 


The right-hand-side of (1151) is simplified further if the chosen kernel is the Gaussian one. As the 
convolution of Gaussian pdfs is also Gaussian, it is not hard to verify that 


for which (|15l) becomes 


J K'£ (a4 r) , *) «£ (x„, \x)dx = k 2S (a?i r) , X%, ] ) 

f p( r \x)p( r \x)dx = *1 


where the v x v' matrix * has (n,n ')th entry \K < ££ ^] nn > := n{ x n\x^). It thus follows 


that 


A CS {P (r) \\P (rl) ) = - 2log 


+ log(^l T K^ ) l 




+ log ( T72 1 K 


L 2£ 


(16) 
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Algorithm 4 Divergence (Di)SkeVa K-means on N. 

Input: Data X; number of clusters K ; number v and v' of points for sketching and validation 
phases, respectively; number of realizations R m ax ; ^max — 0, A min — +OO. 

Output: Data-cluster associations. 

1: for r = 1 to X max do 

2: Let X( r ^ denote v randomly selected points after centered around their sample mean. 

3: if Acs||p (0) ) > A max then 

4: Let X^ r '^ denote v' randomly selected points, other than those in step 0 after centered 

around their sample mean. 

5: XW := [jWjM]. 

6: if Acs(p (r) \\P {r,) ) < A' min then 

7: AD n :=A C s(p (r) |l^). 

8: A max := A C s(p (r) ||p (0) ). 

9: r* := r. 

10: end if 

11: end if 

12: end for 

13: Perform A'-means on X^ r *^ and associate X to clusters obtained by A'-means on X^*^. 


Notice that when r' is simply {0}, the last summand in (1161) becomes log (0,0) = —D(log 2tt)/2— 
(log det 2Y,)/2. 

The metric in (1161) is computed per draw of the so-termed divergence-based DiSkeVa K-means 
summarized in Alg. |4j A number A max of realizations are attempted to discover a “good” draw 
r* of data, to which clustering is finally performed. Line [3] in Alg. [5] checks whether the randomly 
selected subset yield via (fl4l) a pdf p^ that differs enough from the “singular” pdf p(°^ = ke( 0, •). 
If the divergence exceeds A max , realization r will be further explored, otherwise r + 1 is drawn. 
Notice that threshold A max is adaptively defined and takes, according to line [H the maximum 
recorded value from realization r = 0 till the current one. 

If p^ passes the first check of being far from p(°\ lines |4] to [121 implement the second step of 
consenting whether r is indeed a “good” realization. To this end, a number of u' additional data- 
points is drawn to form X^ := [X^, X^] in line El The mixture pdf]/”- 1 corresponding to X^ r \ 
should stay as close as possible to p^ since reliable pdf estimates should remain approximately 
invariant as extra data are added. Drastic changes of Acs before and after augmentation suggests 
that the draw is likely not to be a good representative of the whole population. Notice here that 
AG n is also adaptively defined to take the minimum value among all recorded divergences from the 
start of iterations. Moreover, both updates of A( llin and A max are performed once the candidate 
draw r has passed through the “check-points” of lines [3] and [6l 

In the case where the Gaussian kernel is employed, Alg. [4] has overall complexity o[DR max i) 2 + 
DR ma , x (i'is' + f' /2 )], if the kernel matrix IX 2 S of all data {x n }^ =1 is not stored in memory and 
calculations of all kernel sub-matrices in (U6j) are performed per realization; plus, O(DuKI) for a 
single application of X-means on the finally selected draw r*. If IT 2 S is available in memory, then 
Alg. [4] incurs complexity o[X max z> 2 + A max (zzi> / + v' 2 ) + DuKI ], that is quadratic only in the small 
subset sizes. 

Remark 5. Along the lines of Remark [2l let p denote the probability of having out of R SkeVa 
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realizations at least one “good draw” of size i>, meaning one for which R-means yields centroids 
close to those found with the “full data-set.” Moreover, let q denote the probability of a datum 
to lie “close” to its associated centroid. For example, q can capture the probability of having a 
datum located within a confidence region which is centered at its associated centroid and contains 
a high percentage of its pdf mass. The probability of having all “bad draws” is clearly 1 — p. 
Assuming that data are drawn independently, the probability of having one draw contain only data 
located “far away” from centroids is (1 — q) v . Due to the independence of random draws in SkeVa, 
1 — p = (1 — q) uR , which implies that R ~ log(l — p)/[£Tog(l — q)]. Analogous to Remark El this 
argument neither involves N nor it depends on the validation metric or pertinent thresholds and 
bounds. 


4.2 High-dimensional data 

Alg. [3] remains operational also when DiSkeVa K-means deals with D 3>. Although the proposed 
scheme can be generalized to cope with both N and D 3>, for simplicity of exposition it will be 
assumed that only D ^$>. To this end, consider the following pdf estimate R d : 

1 N 

p (r \x) := —^2 k(x£\x) , VxeR d 

72=1 


(V) 

where Xn denotes a d x 1 subvector of the D x 1 vector x n . 

The counterpart of Alg. [4] on dimensions is listed as Alg. [5j Although along the lines of Alg. 0] 
there is a notable difference. In the validation step, where dimensions are increased (cf. line El) and 
a pdf estimate is needed for the augmented set of variables. To define divergence between pdfs of 
different dimensions, vectors have to be zero padded from the d-dimensional Xn to the ( d + d')- 
dimensional Xn' 1 '■= [*n^ T , 0 T ] T . Recall here that Xn^ := [xn^ T ,Xn ^ T ] T - To avoid confusion, pdf 
mixtures on these zero-padded vectors are given by 


N 


n( r ) 


(*) = 5Z' S (^n ) ’ X ) ’ V;Ee 


(17) 


n= 1 


Similar to Alg. 0] the overall complexity of Alg. [5] is o[(d + d')N 2 R max ] for computations in (fT6|). 
plus O(dNKI) for a single application of /F-means on the finally selected draw r*. 

Remark 6. If multi-core machines are also available, the validation phases of Algs. [1]|5] can be 
readily parallelized, using recent advances on efficient parallel computing platforms such as MapRe¬ 
duce mm- 


5 Numerical Tests 

We validated the proposed algorithms on synthetic and real data-sets. Tests involve either large 
number of data (N 3>) and/or large number of dimensions (D 3>). The following methods were 
also tested: (i) The standard hard JF-means [cf. (fT|) ], run on the full range of N data-points and D 
dimensions, which is abbreviated in the figures as “full A'-means”; (ii) the state-of-the-art RP-based 
feature-extraction scheme j5l Alg. 2], with a Bernoulli-type RP matrix as in jTj; (iii) the randomized 
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Algorithm 5 Divergence (Di)SkeVa K-means on D. 

Input: Data X; number of clusters K; number d and d' of dimensions for sketching and validation 
phases, respectively; number of realizations R m ax ; A max = 0, A/ lin = +oo. 

Output: Data-cluster associations. 

1: for r = 1 to -R max do 

2: Center randomly selected X^ around their sample mean to obtain X^ r \ 

3: if A C s(p (r) ||p (0) ) > A max then , 

4: Center randomly selected X^ r \ other than those in step El around their sample mean 

to obtain X^ r '\ 

5: Form iW := [lW T , X( r ') T ] T . 

6: Form := [X^ T ,0 T ] T and estimate pdf mixture [cf. (fl7l) ]. 

7: if Acs (p (r) II <? (r) ) < A G„ then 

8: AD n :=Acs(p W lk“ (r) )- 

9: A max := Acs(p (r) ||p (0) ). 

10: r* := r. 

11: end if 

12: end if 

13: end for 

14: Perform A-means on and associate X to clusters obtained by A-means on X^ r *\ 


feature-selection (RFS) algorithm [5j Alg. 1; e = 1/3], a leverage-scores-based scheme; and (iv) the 
“approximate kernel K-means” algorithm |7] Alg. 2], which solves for an “optimal” data-cluster 
association matrix given a randomly selected subset of the original data-points. For fairness, the 
naive kernel It-means algorithm in [7] Alg. 1] is not tested, because a random draw of data and the 
application of It-means is done only once in [TJ Alg. 1]; hence, the attractive attribute of multiple 
independent draws is not leveraged as in SkeVa. To mitigate initialization-dependent performance, 
each realization of A-means, including also its usage as a module in other competing methods, is 
run five times with different initialization per run, keeping finally only the data-clusters association 
that results with the smallest sum of distances of data from the associated centroids. 

As figures of merit we adopted the relative clustering accuracy and the execution time (in 
secs). Relative clustering accuracy is defined as the percentage of points assigned to the correct 
clusters (empirical probability of correct clustering), relative to that of (kernel) /F-means on the 
full data-set. Regarding computational time evaluations, tests in Sec. 15.II are run using Matlab [22] 
on a SunFire X4600 PC with a 32-core AMD Opteron 8356, clocked at 2.3GHz with 128GB RAM 
memory [24] . without the use of parallelization, on a single computational thread. Tests in Secs. 15.21 
15.31 and 15.41 are run on an HP ProLiant BL280c G6 server using 2 eight-core Sandy Bridge E5-2670 
processor chips (2.6GHz) and 128GB of RAM memory [23] • In the latter tests, algorithms were 
allowed to exploit MATLAB’s inherent multithread capabilities [24] on the 16 cores of the server. 
Moreover, all plotted curves are averages over 50 Monte Carlo realizations. 

To construct synthetic data, D x 1 vectors {x n }^ =1 were generated according to the following 
model per cluster k\ 

x nj =m k + T,l /2 v nj , j e {1,... ,N/K} (18) 

where it is assumed that N is an integer multiple of K, m k is the D x 1 mean (centroid) of 
cluster k, noise v n . ~ A/"(0, J^) is standardized Gaussian, and is the covariance matrix of the 
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Figure 3: Synthetic data (D = 2,000 and full-rank model). 


data generated for cluster k; hence, x nj ~ X&). Means {rn-fcj^Li are selected uniformly 

at random from a D-dimensional hypercube, as in [5]. To accommodate data-models with limited 
degrees of freedom, the “rank of data,” controlled by the number of non-zero eigenvalues of X*,, 
was used as a tuning parameter. In certain cases, clusters were well separated—a scenario where 
AT-means achieves relatively high clustering accuracy. Throughout this section A max = 10 except 
for the tests using the KDDb database [32] for which A max = 20. 

5.1 Large number of dimensions (D 3>) 

Tests cases in this subsection have D ^$> N. Competing methods are the “full JC-means,” RP 
[5] Alg. 2], and RFS |5j Alg. 1; e = 1/3]. Model (fTHI) was used to generate N = 1, 000 D-dimensional 
vectors for K = 5 clusters, for several values of D, and variable “data-rank.” It can be seen from 
Figs. [3] and [4] that Algs. Q] (SkeVa K-means) and [2] (SeSkeVa K-means) approach the accuracy 
of the full AT-means algorithm as the number of sampled dimensions d increases. As expected, 
computational time is significantly lower than that of full AT-means, since the latter operates on 
all D dimensions. Moreover, SeSkeVa K-means needs more time than RP [5] to achieve the same 
clustering accuracy (cf. Figs. [3] and [4]), since RP utilizes AT-means as a sub-module only once, 
after dimensionality-reduction has been effected by left-multiplication of X with a (fat) d x D 
RP matrix. However, this changes as D grows large. As Fig. [5] demonstrates, whenever D is 
massive, left-multiplying X by the RP d x D matrix in [5] can become cumbersome, resulting in 
computational times larger than those of SeSkeVa K-means. 

Fig. [6] shows results for the real data-set ARCENE [16], which contains mass-spectra of pa¬ 
tients diagnosed with cancer, as well as spectra associated with healthy individuals. Clustering 
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Figure 4: Synthetic data (D = 2,000 and rank equal to 500). 
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Figure 5: Synthetic data (D = 500, 000 and rank equal to 1, 000). 
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Figure 6: Simulated performance for real data-set ARCENE. 


involves grouping 100 (D = 10,000)-dimensional spectra in two clusters (K = 2). The number of 
augmented dimensions for all employed algorithms is d' = 100. All proposed algorithms approach 
the performance of RP and full /C-means, while requiring less time. Alg. [5] is the fastest one at a 
comparable performance. 

Fig. [7] depicts results for the real ORL database of 400 face-images, from 40 different subjects 
(10 each) [21128] , Images have size 92 x 112 with 8-bit grey levels, resulting in D = 10, 304. Only 30 
images (3 subjects) were used, and as such the task is to group these images into K = 3 clusters. 
As with the ARCENE data, the number of additional dimensions for all proposed algorithms is 
d! = 100. Algs. U 1 and [5] exhibit similar performance, while requiring much less time than the 
full JC-means. Again, Alg. 0 is the fastest one at a comparable performance. 

Tests were also performed on a subset of the KDDb 2010 data-set (K = 2, D = 2,990, 384) [32]. 
The version of the data-set is the one transformed by the winner of the KDD 2010 Cup (National 
Taiwan University). In each run 10,000 data-points were chosen randomly from both classes, and 
clustering was performed on this subset. The RFS performance is not reported in Fig. [8] as there 
were issues regarding memory usage due to the required SVD computations. Here, the number of 
augmented dimensions is d! = 1,000. All algorithms exhibit performance similar to full K -means; 
however Alg. |T] and Alg. [5] require significantly less time than all competing alternatives. 

It should be noted also that the required amount of memory per iteration for Algs. |T] and [2] is 
at most 0[N(d + d')\, in contrast with the competing algorithms whose memory requirements grow 
linearly with D. 
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Figure 7: Simulated performance for real data-set ORL. 
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Figure 8: Simulated performance for real data-set KDDb. 
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5.2 Large number of points (N 3>) 

Here we deal with N S> D. Alg.[5]is compared with the “full J\-means” and the “approximate kernel 
AT--means” algorithm [7]. N = 100,000 vectors with D = 5 were generated according to (1181) for 
K = 5 clusters. Although “approximate kernel AT-means” can accommodate nonlinearly separable 
clusters by using nonlinear kernel functions, the linear kernel was used here: k(x, y) := x T y, \/x, y. 
The Gaussian kernel function kj, with 5] := Id, was used in (fl4l) . Fig. [9]shows clustering accuracy 
across the number v of randomly selected data per draw. The number of additional points v' is 
set equal to 100. As Fig. [9] demonstrates, Alg. [2 approaches the performance of the full ii-means 
algorithm, even with z> = 100 sampled data, while requiring markedly lower execution time than 
both “full” and “approximate kernel AT-means.” 

5.3 Kernel clustering 

Nonlinearly separable data are mapped here using a prescribed kernel function to high-dimensional 
spaces to render them linearly separable. Algs. [3] and [2 with kernel AT-means applied only at the 
end, are compared with the “full kernel AT-means” and the “approximate kernel A'-means” j7]. 
Throughout, S := 5 Id was used in (fl4l) . Tests were performed on a subset (N = 35,000) of the 
MNIST-784 data-set, which contains 28 x 28 pixel images of handwritten digits grouped in K = 10 
clusters. The kernel used in this case is the sigmoid one k(x, y) = tanh(aa ; T y + b) with parameters 
a = 0.0045, b = 0.11, in accordance to [Mj. Both the sigmoid and the Gaussian (for the case of 
Alg. IH) kernels are considered stored in memory. Fig. [TO] depicts the relative clustering accuracy for 
this data-set and the required time in seconds. It is clear that accuracies of all three algorithms are 
close and approach the performance of “full kernel A'-means” as the number of sampled data-points 
increases. However, the time required by Algs. [3] and S] is significantly less than the time required 
by the “full” and “approximate kernel AT-means.” 

5.4 Exploiting multiple computational threads 

To showcase the scalability of the proposed algorithms in the presence of multiple computational 
nodes, the algorithms were run on multiple computational threads. The independent draws r of the 
proposed algorithms were executed in parallel. Moreover, competing algorithms were allowed to 
exploit MATLAB’s multithread capabilities, e.g., matrix-matrix multiplications in RP |21j . Figs. [TT] 
and 121 show simulation results for the ARCENE and KDDb data-sets, respectively. Clearly, par¬ 
allelization of the iterations on the proposed algorithms is beneficial since the algorithms exhibit 
about an order of magnitude less required time than that of competing methods. 

6 Conclusions and Future Research 

Inspired by RANSAC ideas that have well-appreciated merits for outlier-resilient regression, this 
paper introduced a novel algorithmic framework for clustering massive numbers of high-dimensional 
data. Several members of the proposed sketching and validation (SkeVa) family were introduced. 
The first two members, a batch and a sequential one tailored to streaming modes of operation, 
required A"-means clustering in low-dimensional spaces and/or a small number of data-points. 
To enable clustering of even nonlinearly separable data, a third member of the family leveraged 
the kernel trick to cluster linearly separable mapped data. A divergence metric was utilized to 
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Figure 9: Synthetic data (D = 5, N = 100, 000, K = 5). 
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Figure 10: A subset of the MNIST data-set. 
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Figure 11: A subset of the ARCENE data-set with multithreading. 
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Figure 12: A subset of the KDDb data-set with multithreading. 






























develop the fourth member of SkeVa K-means that bypasses intermediate /v-means clustering 
to trade-off accuracy for reduced complexity. Extensive numerical tests on synthetic and real 
data-sets demonstrated the competitive performance of SkeVa over state-of-the-art schemes based 
on random projections. Future research will focus on the rigorous performance analysis of the 
proposed framework, and on the application of SkeVa to spectral clustering and its MapReduce 
implementation. 

Appendices 

A Soft Kernel il-means 

Since this appendix deals with N ^>, the clustering schemes of Sec. [2] will be applied here to a 
reduced number v of D x 1 vectors. Let X^ € W Dxu denote the subset of data obtained by 
sketching columns of X. In this context, ([2|) proceeds as follows. For * = 1,2,..., 


[z-a] Update data-cluster associations: For n = 1,... , z>, 

7r n [z] G argmin 5 (ip(x^), Y\ n k c k \i}) + p( tt) . 

7T6[0,1] K v ' 

1 T 7T=1 


[i-b] Update cluster centroids: 


(19a) 


V 

{c k [i + l]} k=x G argmin V 5 
{c k }cH n=1 




(19b) 


Recall that given a kernel k, if <p(x) := k(x, ■), then inner products in "H can be obtained as kernel 
evaluations: (ip(x) \ ip(x')) H = (k(x,-) \ = k(x,x'), where (• | -) w denotes the inner 

product in %. Moreover, function 5 is chosen as 


& {^(x^),^2 k [7v n ] k c k ^ 


<P(Xn ) ) -^Mn] k C k , 


1 /2 

with || • ||^ := (■ | -)^ . It can be shown then by the Representer’s theorem [3] that due to the 
limited number of data {tp(xn^)}^= i, looking for a solution of (Il9bl) in % is equivalent to looking 
for one in the low-dimensional linear subspace := span{<£>(cc^),..., of rank < z>, 

and where span stands for the linear span of a set of vectors. For notational convenience, let 
<f«(A (r )) := [v?(a4 T) ), • • .,(p{x^)], and <f>(A( r ))b := YZ .=l b n <p{xn' ) ), for any b G MF. 

Centroid c k belongs to and can be expressed as a linear superposition of <3?(X^). 

Specifically, there exists b k G HU s.t. c k = ^(X^)b k . Upon defining B := [&i,..., bx], then 
[ci, ..., ck] = <&{X^)B. Moreover, 


6 (f{ x{ n) ’'52 k \ 7V n]kC k S j 


<p(xW)-*(X(r))Bn n 
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which can be also obtained through kernel evaluations. Letting denote the u x u kernel 

matrix with {n,n') th entry [K^] nn i := K,(xn\ x^}), and the i/ X 1 vector with nth entry 

[fe$ P ,] n := K(xn\x^)), it follows from the linearity of inner products that for any i> x 1 vector £, 

I ‘p(x i n ) )) n = £ T k^ r) and ($(!«)$ | As such, the quadratic 

term in (1201) becomes 


= (p{ x n ) ) <p{ x n ] )) n -2^(X (r) )S7T n 
+ ^(xW)S7T n | $(X (r) )B7T n )^ 

= n(4 r) , *i r) ) - 2 tt ^B T k% + ^B t K^B-k u 


a 


( 20 ) 


which shows that kernel AMneans in (|191) boils down to solving a finite-dimensional optimization 
task w.r.t. B and II := [7Ti,..., 7Tj>]. 

Moreover, distances in H between <p(xn G <h(A"( r ^) and centroids cj/* = ^(X^)b^ needed 
in step [5] of Alg. [3] can be efficiently computed because they are also expressible in terms of kernel 
evaluations as follows; 


<p(x£"l) - 4>(xM)&< r) 

= (»>«">) | *>«■'>))„ - 2(*(*M)&M 
+ (®(xW)(,0|$(JfC))60) M 
= - 26<’- )T fcW + . 


n 


References 

[1] D. Achlioptas, “Database-friendly random projections,” in Proc. of SIGMOD-SIGACT- 
SIGART. ACM, 2001, pp. 274-281. 

[2] AT&T Laboratories Cambridge, UK. The ORL database of faces. [Online]. Available: 
http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html 

[3] T. Bengtsson, P. Bickel, and B. Li, “Curse-of-dimensionality revisited: Collapse of the particle 
filter in very large scale systems,” in Probability and Statistics: Essays in Honor of David 
A. Freedman. IMS, 2008, vol. 2, pp. 316-334. 

[4] C. M. Bishop, Pattern Recognition and Machine Learning. New York: Springer, 2006. 

[5] C. Boutsidis, A. Zouzias, M. Mahoney, and P. Drineas, “Randomized dimensionality 
reduction for A'-rneans clustering,” Computing Research Repository, 2011. [Online]. Available: 
arXiv: 1110.2897 


25 

















[6] C. Chatterjee and V. Roychowdhury, “On self-organizing algorithms and networks for class- 
separability features,” IEEE Trans. Neural Networks, vol. 8, no. 3, pp. 663-678, 1997. 

[7] R. Chitta, R. Jin, T. C. Havens, and A. K. Jain, “Approximate kernel AT-means: Solution to 
large-scale kernel clustering,” in Proc. of Knowledge Discovery Data Mining, San Diego CA: 
USA, Aug. 2011. 

[8] O. Chum and J. Matas, “Optimal randomized RANSAC,” IEEE Trans. Pattern Analysis and 
Machine Intelligence, vol. 30, no. 8, pp. 1472-1482, 2008. 

[9] K. L. Clarkson and D. P. Woodruff, “Low rank approximation and regression in input 
sparsity time,” in Proc. of Symposium on Theory of Computing, June 2013, pp. 81-90. 
[Online]. Available: arXiv:1207.6365v4 

[10] K. Cukier, “Data, data everywhere,” The Economist, 2010. [Online]. Available: 
http://www. economist .com/node/15557443. 

[11] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Comm. 
ACM, vol. 51, no. 1, pp. 107-113, 2008. 

[12] I. S. Dhillon, Y. Guan, and B. Kulis, “Kernel A'-means: Spectral clustering and normalized 
cuts,” in Proc. of SIGKDD Inti. Conf. Knowledge Discovery Data Mining. ACM, 2004, pp. 
551-556. 

[13] A. Elgohary, A. K. Farahat, M. S. Kamel, and F. Karray, “Embed and conquer: Scalable 
embeddings for kernel I\ -means on MapReduce,” in Proc. of Inti. Conf. Data Mining, Shenzen: 
China, Dec. 2014. 

[14] M. Fisher and R. Bolles, “Random sample consensus: A paradigm for model fitting with 
applications to image analysis and automated cartography,” Comm. ACM, vol. 24, pp. 381- 
395, June 1981. 

[15] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” J. Machine 
Learning Research, vol. 3, pp. 1157-1182, 2003. 

[16] I. Guyon, S. R. Gunn, A. Ben-Hur, and G. Dror, “Result analysis of the NIPS 2003 feature 
selection challenge,” in Proc. of Neural Info. Process. Systems, vol. 4, Whistler BC, Canada, 
2004, pp. 545-552. 

[17] M. Jordan, “On statistics, computation and scalability,” Bernoulli, vol. 19, no. 4, pp. 1378- 
1390, 2013. 

[18] B. Kang, W. Lim, and K. Jung, “Scalable kernel A'-means via centroid approximation,” in 
Proc. of NIPS, Granada: Spain, Dec. 2011. 

[19] M. Mahoney, “Randomized algorithms for matrices and data,” Found. Trends Machine Learn., 
vol. 3, no. 2, pp. 123-224, 2011. 

[20] — , “Algorithmic and statistical perspectives on large-scale data analysis,” in Combinatorial 
Scientific Computing, U. Naumann and O. Schenk, Eds. Chapman and Hall/CRC, 2012, 
ch. 16, pp. 427-459. [Online]. Available: arXiv:1010.1609vl 


26 




[21] Mathworks. Matlab multicore. [Online]. Available: 

ht t p: / / www. mathworks. com / discovery / matlab- multicore. lit ml 

[22] MATLAB, version 8.2.0.701 (R2013a). Natick MA: USA: The MathWorks Inc., 2013. 

[23] C. Micchelli and M. Pontil, “Learning the kernel function via regularization,” J. Machine 
Learning Research, vol. 6, pp. 1099-1125, Sept. 2005. 

[24] Minnesota Supercomputing Institute. [Online]. Available: http://www.msi.umn.edu/ 

[25] D. Nister, “Pre-emptive RANSAC for live structure and motion estimation,” Machine Vision 
Appl, vol. 16, no. 5, pp. 321-329, 2005. 

[26] V. M. Patel, H. V. Nguyen, and R. Vidal, “Latent space sparse subspace clustering,” in Proc. 
of ICCV, Sydney: Australia, 2013, pp. 225-232. 

[27] J. C. Principe, Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives. 
Springer, 2010. 

[28] F. Samaria and A. Harter, “Parameterization of a stochastic model for human face identifi¬ 
cation,” in Proc. of Workshop on Applications of Computer Vision, Sarasota FL: USA, Dec. 
1994. 

[29] R. Toldo and A. Fusiello, “Robust multiple structures estimation with J-linkage,” in Proc. of 
ECCV. Marseille: France: Springer, 2008, pp. 537-547. 

[30] P. H. Torr and A. Zisserman, “MLESAC: A new robust estimator with application to estimat¬ 
ing image geometry,” Computer Vision and Image Understanding, vol. 78, no. 1, pp. 138-156, 
2000 . 

[31] B. Yu and B. Yuan, “A more efficient branch and bound algorithm for feature selection,” 
Pattern Recognition, vol. 26, no. 6, pp. 883-889, 1993. 

[32] H.-F. Yu, H.-Y. Lo, H.-P. Hsieh, J.-K. Lou, T. G. Mckenzie, J.-W. Chou, P.-H. Chung, C.-H. 
Ho, C.-F. Chang, J.-Y. Weng, E.-S. Yan, C.-W. Chang, T.-T. Kuo, P. T. Chang, C. Po, C.-Y. 
Wang, Y.-H. Huang, Y.-X. Ruan, Y.-S. Lin, S.-D. Lin, H.-T. Lin, and C.-J. Lin, “Feature engi¬ 
neering and classifier ensemble for KDD cup 2010,” in Proc. ACM SIGKDD Conf. Knowledge 
Discovery Data Mining, Washington, DC, July 2011. 

[33] H. Zhang and G. Sun, “Feature selection using Tabu search method,” Pattern Recognition, 
vol. 35, pp. 701-711, 2002. 

[34] R. Zhang and A. I. Rudnicky, “A large scale clustering scheme for kernel AT-means,” in Proc. 
ICPR, vol. 4, Quebec, Canada, 2002, pp. 289-292. 

[35] M. Zuliani, C. S. Kenney, and B. S. Manjunath, “The multiRANSAC algorithm and its ap¬ 
plication to detect planar homographies,” in Proc. of ICIP, vol. 3, Genova: Italy, 2005, pp. 
Ill—153—6. 


27 




